BONN: Bayesian Optimized Binary Neural Network
69
)XUYY+TZXUV_
2UYY
(G_KYOGT
,KGZ[XK2UYY
8K2;
)UT\
(4
8K2;
)UT\
(4
(G_KYOGT
6X[TOTM2UYY
(G_KYOGT
1KXTKR2UYY
FIGURE 3.20
By considering the prior distributions of the kernels and features in the Bayesian frame-
work, we achieve three new Bayesian losses to optimize the 1-bit CNNs. The Bayesian kernel
loss improves the layerwise kernel distribution of each convolution layer, the Bayesian fea-
ture loss introduces the intraclass compactness to alleviate the disturbance induced by the
quantization process, and the Bayesian pruning loss centralizes channels following the same
Gaussian distribution for pruning. The Bayesian feature loss is applied only to the fully
connected layer.
learning are intrinsically inherited during model quantization and pruning. The proposed
losses can also comprehensively supervise the 1-bit CNN training process concerning kernel
and feature distributions. Finally, a new direction on 1-bit CNN pruning is explored further
to improve the compressed model’s applicability in practical applications.
3.7.1
Bayesian Formulation for Compact 1-Bit CNNs
The state-of-the-art methods [128, 199, 77] learn 1-bit CNNs by involving optimization in
continuous and discrete spaces. In particular, training a 1-bit CNN involves three steps:
a forward pass, a backward pass, and a parameter update through gradient calculation.
Binarized weights (ˆx) are only considered during the forward pass (inference) and gradient
calculation. After updating the parameters, we have the total precision weights (x). As
revealed in [128, 199, 77], how to connect ˆx with x is the key to determining the performance
of a quantized network. In this chapter, we propose to solve it in a probabilistic framework
to learn optimal 1-bit CNNs.
3.7.2
Bayesian Learning Losses
Bayesian kernel loss: Given a network weight parameter x, its quantized code should
be as close to its original (full precision) code as possible, so that the quantization error is
minimized. We then define:
y = w−1 ◦ˆx −x,
(3.97)
where x, ˆx ∈Rn are the full precision and quantized vectors, respectively, w ∈Rn denotes
the learned vector to reconstruct x, ◦represents the Hadamard product, and y ∼G(0, ν)